COMPLEXITY OF TESTING MORPHIC PRIMITIVITY 



VOJTECH MATOCHA AND STEPAN HOLUB 



Abstract. We analyze the algorithm in [4], which decides whether a given 
word is a fixed point of a nontrivial morphism. We show that it can be imple- 
mented to have complexity in 0(m ■ n), where n is the length of the word and 
m the size of the alphabet. A visualization of the algorithm can be found on 
0. 



1. Introduction 

The word u = abaaba satisfies f{u) = u where / maps b to aba and cancels a. 
Such words, which are fixed points of a nontrivial morphism, are called morphically 
imprimitive. On the other hand, the word v! = abba can be easily verified to be 
morphically primitive, which means that the only morphism satisfying f(u') = u! 
defined on {a, b}* is the identity. 

Fixed points of word morphisms and morphically (im)primitive words are studied 
in [21 02 El [5] . In [4], the first polynomial algorithm is presented (called Morphic- 
Factorization) that decides whether a given word w is morphically primitive. 
Moreover, given the input word w, it finds a corresponding morphism satisfying 
f(w) = w with minimal number of letters mapped to a nonempty word (that is, 
not canceled). 

The complexity of MorphicFactorization is estimated as 0(m + logn) • n in 
[4j . Here we make more detailed analysis of the algorithm and improve the estimate 
to • n), where E is the set of those letters x for which f(x) is nonempty. 

2. Definitions 

Let alph(w) denote the set of letters occurring in w and \w\ the length of w. For 

a set S C alph(w), denote by \w\s the number of all occurrences of letters from S 

in w; we shorten a s Ma- 

Each morphism /, satisfying f(w) = w, induces a factorization of w, called a 

morphic factorization. The morphic factorization consists of a set E and a sequence 

(wi, u>2, ■ ■ ■ ,Wk) such that 

• W = WiW 2 ■•■Wk, 

• \w%\b — 1 for each i = 1, 2 } . . . , k, and 

• if \wi\ e = \wj\ e = 1 for some e G E, then Wi = wj. 

It is shown in [2] that we can suppose, without loss of generality, that / is 
idempotent, that is, /(a) = /(/(a)) for each a £ alph(w). (It is enough to iterate 
a general / sufficient number of times in order to obtain its idempotent version.) 
Throughout the paper, we shall therefore assume that / is idempotent. The relation 
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between / and the corresponding morphic factorization is then as follows: for each 
£ = 1, . . . , k, we have to, = /(e), where e is the unique letter from E occurring in Wi, 
and f(a)=e\fa^E (where e denotes the empty word). Letters in E are called 
expanding. We say that E is a minimal set of expanding letters if no proper subset 
of E is the set of expanding letters for a morphism /' satisfying f'(w) — w. In 4J, 
it is shown that all minimal sets of expanding letters have the same cardinality. 

Denote the z-the letter of w by w[i] and write w[i . . . j], with i < j, to denote 
the factor + 1] • • • w[j] of w. We will also work with the set C w of cuts, that 

is, of borders between two consecutive letters (plus the beginning and the end of 
w). A word w has \w\ + 1 cuts and we represent them by integers 0, 1, ... , |to|. The 
cut k is the border following the prefix of length k. Note that cuts i, j delimit the 
factor w[i + 1 . . . j]. 

Given a word w and a morphism / such that f(w) = w, we say that a cut k is a 
left cut if it lies in the image of an expanding letter on its left side. More formally, 
the cut k is a left cut if f(w[l . . . k]) < k. Similarly, we say that k is a right cut 
if f(w[l . . . k]) > k. Note that inequalities are not strict, therefore a cut k can be 
both left and right, which happens if and only if f(w[l . . . k]) = k. Note that cuts 
that are both left and right define the morphic factorization of w induced by /. We 
say that w[i, j] is a stretch factor if £ is a left cut and j is a right cut. 

An important and natural notion is the neighborhood of a letter a in w, denoted 
by n a . The neighborhood of a is the longest extension of a which is possible for 
all occurrences of a in w. It is easy to see that the word n a contains exactly one 
occurrence of a, hence it can be written as n a — l a ar a . 

We will need the following easy observation: 

(1) if b G alph(n a ), then \w\ b > \w\ a . 

Letters with minimal frequency in w that occur in a given factor u play a special 
role in the algorithm. Therefore, we define 

a(i,j) = min{fc | i < k <j, \w\ w[k ] < \w\ w[k ,] for all i < k! < j}. 

In other words, a(i,j) is the leftmost position of a least frequent letter in w[z,j]. 
Note that "least frequent" is measured with respect to whole w, not just with 
respect to w[i,j}. 

3. Description of the Algorithm 

The algorithm MorphicFactorization is based on the following characteriza- 
tion of minimal expanding sets (for proofs and more details see [4]): 

Let E be a minimal set of letters, and L, R minimal sets of cuts satisfying the 
following stability conditions: 

(A) {0>|}CL, {0>|}Ci2. 

(B) Let w[k] = w[k'] = a with a G E, Then 

(a) k — 1 G L and k G R; 

(b) k + \r a \ £ L and k — |l a | - lei?; 

(c) for each — 11 | — 1 < m < r a we have that 

• k + to G L if and only if k' + m G L, and 

• k + to G R if and only if k! + to G R. 

(C) If i G L, j G R with i < j, then w[a(i,j)} G E. 
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Then E is a minimal set of expanding letters. For a 6 E, the image f(a) is defined 
as 

/(a) = lMAGE(a, E, L, R) := w[k — i,k + j], 

where w\k\ = a; i > is the smallest integer such that k — i — 1 <E R; and j > is 
the largest integer such that 

• k + j e i?, and 

• k + j' ^ L holds for each j' < j. 

Stability conditions guarantee that f(a) is well defined, in particular, it is in- 
dependent of the choice of k, and that the resulting morphism satisfies f(w) = w. 
Moreover, all cuts in R are right cuts of the factorization, and cuts in L are left 
cuts. In view of the fact that sets E, L and R represent expanding letters, left 
cuts and right cuts respectively, stability conditions can be rephrased informally as 
follows: 

(A) the extremal cuts are both left and right; 

(B) (a) an expanding letter is delimited by a left and a right cut; 

(b) neighborhood of an expanding letter is delimited by a right and a left 
cut (the left border is a right cut and vice versa); 

(c) neighborhoods of expanding letters are synchronized with respect to 
left and right cuts; 

(C) the leftmost least frequent letter in each stretch factor is expanding. 

The core procedure of the algorithm MorphicFactorization consists in con- 
struction of sets E, L and R satisfying stability conditions. Given a subset E of 
alph(w), we define subsets L(E) and R(E) of C w as the smallest sets satisfying sta- 
bility conditions flXJ) and (0. Similarly, for two subsets L and R of C w , we define 
E(L,R) as the smallest subset of alph(w) satisfying the stability condition JC]). We 
are looking for a set E satisfying E = E(L(E), R(E)). If E ^ E(L,R), then there exist 
cuts i,j violating the condition |C|), that is, the letter w[a(i,j)] is not an element 
of E. Denote such a letter by New(E, L,R). The algorithm is now described by the 
following simple pseudocode. 

MorphicFactorization(w) 

1 E^0; L^{0, |iu|};R<- {0, \w\}; 

2 while E ^ E(L, R) 

3 do E <- E U {New(E,L,R)}; 

4 L <- L(E); R «- R(E); 

5 for each a £ alph(w) 

6 do if a e E 

7 then f(a) <- lMAGE(a, E, L, R); 

8 else f(a) <- e; 

9 return /; 

Several examples illustrating the work of the algorithm can be found in [3]. It 
can be also tested and visualized on [7\ . Here we add one more example. It can be 
also understood as a replacement of Example 7 in [4 , which is mistaken. 

Example 1. Consider w = caabcaadeaabeaad, where n a = a, n& = aab, n c = caa, 
rid = aad and n e = eaa. Let us follow the run of the algorithm. At the beginning 
we set E = and L = R = {0, 16}. Rounds of the while loop yield the following: 
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Round 1. • Q) implies New(E, L, R) = w[a(0, 16)] = w[l] = c; 

R R 

') c ! a 1 a I d 1 c ! a ! o 1 (J ! e I o 10 a n b 12 e is a 14 a i'r, ^ i'e 



• since c € E, (lBa|) implies 0, 4 <E L , 1, 5 G R, and (|Bb|) implies 3, 7 G L , 
0, 4 G R; the condition (|Bc|) is satisfied. 



R R R R R 

| c I a | a | 6 | c j a | a j j e j a i|o a i|i & 12 e 13 a y a y ^ y 
L L L L L 

Round 2. • (0) implies New(E, L, R) = w[a(Z, 4)] = w[4] = b; 

R R R R R 

i c ' o ! a I J i c ! a i o ' i ! e 1 o 10 a 11 b 12 e 13 a 14 a i'r, d L 



L /L 



-n 6 



• since b G E, (|BaJ) implies 3,11 £ L , 4,12 € R, and (|Bb[) implies 
4, 12 G L, 1, 9 G R; the condition (IBcl) is satisfied. 



R R R R R R R 

1 c ! a ! a i J ! c ! a I a I a" * c i a 10 a 11 b 12 e 13 a 14 a i'r, d ic 



L L 



L L 



L 

Round 3. • (Q implies New(E, L, R) = w[a(7, 9)] = w[8] = d; 

R R R R R R R 

c ' a 2 a 1 d i c ■' a i a ' (i s e i o 10 a 11 b 12 e 13 a 14 a i' r , d i'<> 



L L 



L L 



• since d £ E, (|Ba|) implies 7, 15 G L , 8, 16 G R, and (JBbJ) implies 
8, 16 G L, 5, 12 G R; the condition (IBcl) is satisfied. 



R R 



R R 



R R 



R R 



a c 1 a j a j ji j c j a s a 7 j a e a a 10 a y b y e y Q y Q in in 
L LL LL LL LL 

Round 4. • Q implies New(E, L, R) = w[a(8, 9)] = w[9] = e; 

RR RR RR RR R 

i c 1 a i o I J i c ■' a i a ! d i e i o 10 a 11 6 12 e 13 a 14 a i' r , d 10 



L L 



L L 



L L 



L L 



• all conditions (TBI are satisfied. 
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The remaining part of the algorithm MorphicFactorization defines 
/ : a i— > e, b M> aab, c <— > c, d H> aad, e4e. 

Note that also 

/ : a H- e, b ^ ab, c H ► ca, i — >- ad, e h-> ea, 

and 

/ : a i— £, &i->6, c*-> caa, d, ei-> eaa 

are possible morphisms with the same set of expanding letters. 



4. Complexity analysis 

In this section, we show that the complexity of the algorithm is in 0(m • n), 
where n is the length of the analyzed word, and m is the number of its letters. 
More precisely, we show that the complexity is in 

0(\E\-n), 

where E is a minimal set of expanding letters. 

The core of the algorithm is the while loop. The condition E = E(L, R) is checked 
\E\ + 1 times and the loop is performed \E\ times since in each round one letter 
is added to E. Therefore, we have to prove that each round of the loop can be 
performed in 0(n). 

It is convenient to calculate, during the initialization phase, the value of \w\ a for 
each a G alph(u>), and also an array Pos[a,«], which yields the position of the «-th 
occurrence of a in w. The initialization phase is linear: it is enough to read the 
input once. 

4.1. Evaluation of the loop condition. Evaluation of the loop condition consists 
in checking whether the stability condition ([C]) is satisfied. If it is not, then the 
evaluation also outputs the letter New(E,L,R). This is done as follows. 

Look through cuts 1 in L in increasing order and for each 1 find the smallest 
cut r G R strictly larger than 1, and k = a(l,r). If u>[k] ^ E, then we have found 
New(E, L, R) and start the next round of the while loop. If 1 = n and no violation 
of Q was detected, return E = E(L,R). 

Note that r and k can never decrease, therefore the procedure is in 0(n). How- 
ever, not all factors w[i,j] with i G L and j G R are checked; hence it has to be to 
shown that the stability condition ([C]) is verified correctly. 

Suppose, for a contradiction, that our procedure outputs E = E(L,R), although 
i e L and j G R violate the the stability condition JC]). Assume that j — i is as 
small as possible. Let j' < j be the smallest cut in R strictly larger than i and let 
k' = a(i,j'). Since the stretch factor w[i,j'] has been checked by the procedure, 
we deduce w[k'] G E (and k' = j'). On the other hand, by assumption, we have 
w[k] ^ E, where k = a(i,j). Hence k' < k and |u>|fc < \w\k>- The stability condition 
(|Bbjl implies %' = k' + \r w [k'\ \ G L, and we deduce i' < k, since the letter w[k] is not 
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Clearly, k = a(i,j) — a(i',j), whence the factor w[i',j] violates (|C]) too, a contra- 
diction with minimality of j — i. 

4.2. Construction of L and R. The construction of sets L and R in each round 
consists in checking the stability condition ((Bj) (the stability condition {XJ is full- 
filled by the first line of the algorithm) . 

The condition fBaJ says that, for a new letter a £ E, we have to add positions 
immediately before occurrences of a to the set L, and positions immediately after 
its occurrences to the set R. This can be done in C(|w| a )- 

Similarly, the condition (|Bbj) adds starting positions of n a to R, and ending 
positions to L, where a is a letter newly added to E. This requires to calculate 
n a , which is done as follows. In order to calculate |r |, check, for growing k > 1, 
whether all letters 

w[Pos[a, i] + k], i = 1, 2, . . . , \w\ a 

agree, until a mismatch is encountered for k = \r a \ + 1. Similarly, with decreasing 
k < — 1, it is possible to calculate |l a |. The notion of a neighborhood implies that 
neighborhoods of different occurrences of the same letter cannot overlap too much; 
each position lies in at most two distinct neighborhoods of the same letter: once 
in its left part and once in its right part. The number of positions visited during 
the calculation is therefore at most 2n. We conclude that the cost of calculating 
n a and of satisfying (|Bb[) is in 0(n). 

The stability condition (|Bc[) is the most complex one. It can be concisely de- 
scribed as keeping all neighborhoods n a of the same letter a from E synchronized. 
The underlying structure is an undirected graph with vertices C w satisfying the 
following condition: 

cuts Pos[a, i] + k and Pos[a, i] + k 

are connected for each 

o6E,l<j,!'< \w\ a and -|l a | - 1 < k < |r a |. 

The condition (|Bc[) then requires that connected cuts either all are, or all are not 
elements of L (of R resp.). In other words, being in L (in R resp.) is a property of 
a connected component rather than of an individual cut. We shall represent this 
information as a forest of rooted trees of height one. Each cut is linked to its parent, 
which is the root representing the connected component. The root also keeps the 
information whether the component is in sets L, R. Checking whether the cut is in 
L (in R resp.) therefore requires constant time. 

When a new letter a is added to E, new edges synchronizing neighborhoods of 
a have to be added too, and the graph becomes more complex. To satisfy the 
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condition (|Bc|) as it is formulated in the previous paragraph, it is enough to add 
edges 

(Pos[a, 1] + k, Pos[a, i] + k) 
for i — 2, . . . , \w\ a and — |l a | — 1 < k < \r a \. The number of new edges can be 
bounded by an argument similar to the one used above when calculating neighbor- 
hoods: each cut is the second vertex of a new edge at most two times. This implies 
that the number of new edges is less than 2n. After new edges have been added, 
the algorithm searches the whole graph and compresses the connected components 
back to the forest of height one. Since the graph has at most n old vertices and at 
most 2n new ones, this can be done in 0{n). 

The final definition of / is clearly in 0(n), which completes the proof. 

5. Conclusion 

We have shown that morphic primitivity can be tested in linear time for fixed 
alphabet. This may be surprising compared with the fact that a similar problem, 
checking the existence of a morphism between two distinct words, is NP-complete 

(cf. DO). 

If the alphabet is not fixed, the algorithm is at worst quadratic, consider for 
example the family of morphically primitive words 

w n = aia 2 ■ ■ ■ a n -ia n a n a n -i ■ ■ ■ a 2 ai, 

for which the main loop of the algorithm runs n/2 rounds. On the other hand, our 
analysis implies that it can be checked in linear time that all letters in w n have 
trivial neighborhoods, whence the morphic primitivity follows. Precise complexity 
in the uniform case therefore remains unclear. 
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